Learning to rank[1] or machine-learned ranking (MLR) is a type of supervised or semi-supervised machine learning problem in which the goal is to automatically construct a ranking model from training data. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item. Ranking model's purpose is to rank, i.e. produce a permutation of items in new, unseen lists in a way, which is "similar" to rankings in the training data in some sense.
Learning to rank is a relatively new research area which has emerged in the past decade.
Contents |
Ranking is a central part of many information retrieval problems, such as document retrieval, collaborative filtering, sentiment analysis, computational advertising (online ad placement).
A possible architecture of a machine-learned search engine is shown in the figure to the right.
Training data consists of queries and documents matching them together with relevance degree of each match. It may be prepared manually by human assessors (or raters, as Google calls them), who check results for some queries and determine relevance of each result. It is not feasible to check relevance of all documents, and so typically a technique called pooling is used — only top few documents, retrieved by some existing ranking models are checked. Alternatively, training data may be derived automatically by analyzing clickthrough logs (i.e. search results which got clicks from users),[2] query chains,[3] or such search engines' features as Google's SearchWiki.
Training data is used by a learning algorithm to produce a ranking model which computes relevance of documents for actual queries.
Typically, users expect a search query to complete in a short time (such as a few hundred milliseconds for web search), which makes it impossible to evaluate a complex ranking model on each document in the corpus, and so a two-phase scheme is used.[4] First, a small number of potentially relevant documents are identified using simpler retrieval models which permit fast query evaluation, such as vector space model, boolean model, weighted AND,[5] BM25. This phase is called top- document retrieval and many good heuristics were proposed in the literature to accelerate it, such as using document's static quality score and tiered indexes.[6] In the second phase, a more accurate but computationally expensive machine-learned model is used to re-rank these documents.
Learning to rank algorithms have been applied in areas other than information retrieval:
For convenience of MLR algorithms, query-document pairs are usually represented by numerical vectors, which are called feature vectors. Such approach is sometimes called bag of features and is analogous to bag of words and vector space model used in information retrieval for representation of documents.
Components of such vectors are called features, factors or ranking signals. They may be divided into three groups (features from document retrieval are shown as examples):
Some examples of features, which were used in the well-known LETOR dataset:[11]
Selecting and designing good features is an important area in machine learning, which is called feature engineering.
There are several measures (metrics) which are commonly used to judge how well an algorithm is doing on training data and to compare performance of different MLR algorithms. Often a learning-to-rank problem is reformulated as an optimization problem with respect to one of these metrics.
Examples of ranking quality measures:
DCG and its normalized variant NDCG are usually preferred in academic research when multiple levels of relevance are used.[12] Other metrics such as MAP, MRR and precision, are defined only for binary judgements.
Recently, there have been proposed several new evaluation metrics which claim to model user's satisfaction with search results better than the DCG metric:
Both of these metrics are based on the assumption that the user is more likely to stop looking at search results after examining a more relevant document, than after a less relevant document.
Tie-Yan Liu of Microsoft Research Asia in his paper "Learning to Rank for Information Retrieval"[1] and talks at several leading conferences has analyzed existing algorithms for learning to rank problems and categorized them into three groups by their input representation and loss function:
In this case it is assumed that each query-document pair in the training data has a numerical or ordinal score. Then learning-to-rank problem can be approximated by a regression problem — given a single query-document pair, predict its score.
A number of existing supervised machine learning algorithms can be readily used for this purpose. Ordinal regression and classification algorithms can also be used in pointwise approach when they are used to predict score of a single query-document pair, and it takes a small, finite number of values.
In this case learning-to-rank problem is approximated by a classification problem — learning a binary classifier which can tell which document is better in a given pair of documents. The goal is to minimize average number of inversions in ranking.
These algorithms try to directly optimize the value of one of the above evaluation measures, averaged over all queries in the training data. This is difficult because most evaluation measures are not continuous functions with respect to ranking model's parameters, and so continuous approximations or bounds on evaluation measures have to be used.
A partial list of published learning-to-rank algorithms is shown below with years of first publication of each method:
Year | Name | Type | Notes |
---|---|---|---|
2000 | Ranking SVM (RankSVM) | pairwise | A more recent exposition is in,[2] which describes an application to ranking using clickthrough logs. |
2002 | Pranking | pointwise | Ordinal regression. |
2003 | RankBoost | pairwise | |
2005 | RankNet | pairwise | |
2006 | IR-SVM | pairwise | Ranking SVM with query-level normalization in the loss function. |
2006 | LambdaRank | listwise | RankNet in which pairwise loss function is multiplied by the change in the IR metric caused by a swap. |
2007 | AdaRank | listwise | |
2007 | FRank | pairwise | Based on RankNet, uses a different loss function - fidelity loss. |
2007 | GBRank | pairwise | |
2007 | ListNet | listwise | |
2007 | McRank | pointwise | |
2007 | QBRank | pairwise | |
2007 | RankCosine | listwise | |
2007 | RankGP | listwise | |
2007 | RankRLS | pairwise |
Regularized least-squares based ranking. The work is extended in [15] to learning to rank from general preference graphs. |
2007 | SVMmap | listwise | |
2008 | LambdaMART | listwise | Winning entry in the recent Yahoo Learning to Rank competition used an ensemble of LambdaMART models.[16] |
2008 | ListMLE | listwise | Based on ListNet. |
2008 | PermuRank | listwise | |
2008 | SoftRank | listwise | |
2008 | Ranking Refinement[17] | pairwise | A semi-supervised approach to learning to rank that uses Boosting. |
2008 | SSRankBoost[18] | pairwise | An extension of RankBoost to learn with partially labeled data (semi-supervised learning to rank) |
2008 | SortNet[19] | pairwise | SortNet, an adaptive ranking algorithm which orders objects using a neural network as a comparator. |
2009 | MPBoost | pairwise | Magnitude-preserving variant of RankBoost. The idea is that the more unequal are labels of a pair of documents, the harder should the algorithm try to rank them. |
2009 | BoltzRank | listwise | Unlike earlier methods, BoltzRank produces a ranking model that looks during query time not just at a single document, but also at pairs of documents. |
2009 | BayesRank | listwise | Based on ListNet. |
2010 | NDCG Boost[20] | listwise | A boosting approach to optimize NDCG. |
2010 | GBlend | pairwise | Extends GBRank to the learning-to-blend problem of jointly solving multiple learning-to-rank problems with some shared features. |
2010 | IntervalRank | pairwise & listwise | |
2010 | CRR | pointwise & pairwise | Combined Regression and Ranking. Uses stochastic gradient descent to optimize a linear combination of a pointwise quadratic loss and a pairwise hinge loss from Ranking SVM. |
Note: as most supervised learning algorithms can be applied to pointwise case, only those methods which are specifically designed with ranking in mind are shown above.
C. Manning et al.[21] trace earliest works on learning to rank problem to papers in late 1980s and early 1990s. They suggest that these early works achieved limited results in their time due to little available training data and poor machine learning techniques.
In mid-1990s Berkeley researchers used logistic regression to train a successful ranking function at TREC conference.
Several conferences, such as NIPS, SIGIR and ICML had workshops devoted to the learning-to-rank problem since mid-2000s, and this has stimulated much of academic research.
Commercial web search engines began using machine learned ranking systems since 2000s. One of the first search engines to start using it was AltaVista (later its technology was acquired by Overture, and then Yahoo), which launched a gradient boosting-trained ranking function in April 2003.[22][23]
Bing's search is said to be powered by RankNet algorithm,[24] which was invented at Microsoft Research in 2005.
In November 2009 a Russian search engine Yandex announced[25] that it had significantly increased its search quality due to deployment of a new proprietary MatrixNet algorithm, a variant of gradient boosting method which uses oblivious decision trees.[26] Recently they have also sponsored a machine-learned ranking competition "Internet Mathematics 2009"[27] based on their own search engine's production data. Yahoo has announced a similar competition in 2010.[28]
As of 2008, Google's Peter Norvig denied that their search engine exclusively relies on machine-learned ranking.[29] Cuil's CEO, Tom Costello, suggests that they prefer hand-built models because they can outperform machine-learned models when measured against metrics like click-through rate or time on landing page, which is because machine-learned models "learn what people say they like, not what people actually like".[30]